Load the functions
source("thesis_functions.R")
dhist.plot <- function(d, v, vname){
ggplot(d, aes(x=v)) +
geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
stat_function(
fun = dnorm,
args = list(mean = mean(d$v), sd = sd(d$v)),
lwd = 1,
col = 'red'
) +
# xlab(vname) + ylab("Density") +
# ggtitle(paste(vname, "Distribution")) +
theme_bw()
}
set levels of ordinal data, and other variables to their appropriate dataypes. Did not adjust the datatypes for date/time since those will not be used in this portion
data <- fread("kpopdata.csv")
data <- mutate(data, ArtistType = as.factor(ArtistType),
ArtistGender = as.factor(ArtistGender),
ArtistGen = factor(ArtistGen),
release_date = as.POSIXct(release_date, format = "%m/%d/%y"),
key = factor(key, levels = 0:11),
mode = as.factor(mode),
time_signature = factor(time_signature, levels = c(1,3,4,5)))
#make month/year as continuous data in the form of number of months (~360)
ref.year = 1992 #we are using January 1992 as the reference
data$months = 12*(year(data$release_date) - ref.year) + month(data$release_date)
#View(select(data, Artist, song_name, release_date, months))
Testing how to transform back from a log function ... this works :D
data$months[1:10]
## [1] 243 243 243 243 243 243 243 240 345 309
log.trans <- log(360-data$months);log.trans[1:10]
## [1] 4.762174 4.762174 4.762174 4.762174 4.762174 4.762174 4.762174 4.787492
## [9] 2.708050 3.931826
back.trans <- (exp(log.trans) - 360) / -1;back.trans[1:10]
## [1] 243 243 243 243 243 243 243 240 345 309
Summary of the data to gain an understanding of the range.
summary(data$months)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 241.0 285.0 266.3 317.0 349.0
check out distribution of the months
#dhist.plot(data, months, "Month Released")
ggplot(data, aes(x=months)) +
geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
stat_function(
fun = dnorm,
args = list(mean = mean(data$months), sd = sd(data$months)),
lwd = 1,
col = 'red'
) +
xlab("Month Released") + ylab("Density") +
ggtitle(paste("Month Released", "Distribution")) +
theme_bw()
#produce normal qqplot to assess normality
qqnorm(data$months, pch = 1, frame = FALSE)
qqline(data$months, col = "red", lwd = 2)
#Skewness score
skewness(data$months)
## [1] -1.342445
try to find a transformation to normalize. As we can see the distribution of months is severely skewed to the left.
A common distribution towards normality for moderately skewed data is : \(log(max(y+1) - y)\), since the max number of months is 349, the transformation would be : \(log(350 - y)\)
summary of the data under this transformation
summary(log(350-data$months))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.497 4.174 4.029 4.691 5.849
Already by looking at the 5 number summary it is evidence that there exists skewness isnce the median is much closer to the max than the min.
ggplot(data, aes(x=log(350-data$months))) +
geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
stat_function(
fun = dnorm,
args = list(mean = mean(log(350-data$months)), sd = sd(log(350-data$months))),
lwd = 1,
col = 'red'
) +
xlab("Month Released : log(350 - y))") + ylab("Density") +
ggtitle(paste("Month Released : log(350 - y))", "Distribution")) +
theme_bw()
#produce normal qqplot to assess normality
qqnorm(log(350-data$months), pch = 1, frame = FALSE)
qqline(log(350-data$months), col = "red", lwd = 2)
#Skewness score
skewness(log(350-data$months))
## [1] -0.8029814
The distribution looks more normal, but still has noticeable issues of skewness. Therefore, the transformation has not been successful in achieving normality as the observations deviate to much from the normal QQ line.
Since there still exists issues of skewness, we will tweak the standard transformation of negative skewness to this alternative: \(log(360 - y)\)
summary of the data under this transformation
summary(log(360-data$months))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.398 3.761 4.317 4.269 4.779 5.878
Notice that the maximum and minimum value are still very similar to before, however the minimum has increased with this change and now the median is at a more central location than before.
ggplot(data, aes(x=log(360-data$months))) +
geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
stat_function(
fun = dnorm,
args = list(mean = mean(log(360-data$months)), sd = sd(log(360-data$months))),
lwd = 1,
col = 'red'
) +
xlab("Month Released : log(360 - y))") + ylab("Density") +
ggtitle(paste("Month Released : log(360 - y))", "Distribution")) +
theme_bw()
#produce normal qqplot to assess normality
qqnorm(log(360-data$months), pch = 1, frame = FALSE)
qqline(log(360-data$months), col = "red", lwd = 2)
#calculate skewness
skewness(log(360-data$months))
## [1] -0.2249134
The distribution of months has greatly improved and looks roughly normal. Furthermore, the normal qqplot shows great improvement
summary(sqrt(350-data$months))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.745 8.062 8.382 10.440 18.628
Another common transformation for moderate skewed data is a square root approach: \(\sqrt{max(y+1) - y}\) = \(\sqrt{(350 - y)}\)
ggplot(data, aes(x=sqrt(350-data$months))) +
geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
stat_function(
fun = dnorm,
args = list(mean = mean(sqrt(350-data$months)), sd = sd(sqrt(350-data$months))),
lwd = 1,
col = 'red'
) +
xlab("Month Released : sqrt(350 - y))") + ylab("Density") +
ggtitle(paste("Month Released : sqrt(350 - y))", "Distribution")) +
theme_bw()
#produce normal qqplot to assess normality
qqnorm(sqrt(350-data$months), pch = 1, frame = FALSE)
qqline(sqrt(350-data$months), col = "red", lwd = 2)
#calculate skewness
skewness(sqrt(350-data$months))
## [1] 0.433938
It is still somewhat positively skewed, but pretty good!
Lastly, a common transformation for severely negatively skewed data takes an inverse approach: \(\frac{1}{(max(y)+1) - y)}\) = \(\frac{1}{(350 - y)}\)
ggplot(data, aes(x=1/(350-data$months))) +
geom_histogram(aes(y = stat(density)), bins = 30, fill = "white", col = "black") +
stat_function(
fun = dnorm,
args = list(mean = mean(1/(350-data$months)), sd = sd(1/(350-data$months))),
lwd = 1,
col = 'red'
) +
xlab("Month Released : 1/(350 - y))") + ylab("Density") +
ggtitle(paste("Month Released : 1/(350 - y))", "Distribution")) +
theme_bw()
This is a terrible idea. no.
Create Training (75%) and Test data (25%) to train classification models on.
kpop <- dplyr::select(data, months, popularity, duration, acousticness, danceability, energy, instrumentalness, key, loudness, mode, speechiness, tempo, valence)
kpop0 <- kpop %>% dplyr::filter(mode == 0) %>% dplyr::select(-mode)
kpop1 <- kpop %>% dplyr::filter(mode == 1) %>% dplyr::select(-mode)
### Kpop mode 0 train and test
smpl.size0 <- floor(0.75*nrow(kpop0))
set.seed(123)
smpl0 <- sample(nrow(kpop0), smpl.size0, replace = FALSE)
train0 <- kpop0[smpl0,]
test0 <- kpop0[-smpl0,]
### Kpop mode 1 train and test
smpl.size1 <- floor(0.75*nrow(kpop1))
set.seed(123)
smpl1 <- sample(nrow(kpop1), smpl.size1, replace = FALSE)
train1 <- kpop1[smpl1,]
test1 <- kpop1[-smpl1,]
ml0 <- lm(months ~. , data = train0)
summary(ml0)
##
## Call:
## lm(formula = months ~ ., data = train0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -220.905 -29.308 6.643 37.204 199.560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 506.91441 16.88344 30.024 < 2e-16 ***
## popularity 1.60481 0.05435 29.528 < 2e-16 ***
## duration -20.26179 1.81712 -11.151 < 2e-16 ***
## acousticness 31.41991 6.58842 4.769 1.93e-06 ***
## danceability -61.87134 10.53878 -5.871 4.74e-09 ***
## energy -98.67824 10.85541 -9.090 < 2e-16 ***
## instrumentalness 69.35253 14.06277 4.932 8.54e-07 ***
## key1 -3.38890 4.81002 -0.705 0.481137
## key2 -8.25574 5.95496 -1.386 0.165724
## key3 -4.68988 6.66026 -0.704 0.481381
## key4 -16.64198 4.72042 -3.526 0.000428 ***
## key5 -6.37892 4.57708 -1.394 0.163508
## key6 -7.69000 4.77927 -1.609 0.107700
## key7 -3.91007 5.23715 -0.747 0.455354
## key8 -1.77216 5.65747 -0.313 0.754115
## key9 -16.31280 4.85082 -3.363 0.000780 ***
## key10 -13.05634 4.99295 -2.615 0.008962 **
## key11 -13.43026 4.45880 -3.012 0.002613 **
## loudness 14.12750 0.67833 20.827 < 2e-16 ***
## speechiness 39.57966 15.28264 2.590 0.009642 **
## tempo -0.12300 0.04024 -3.057 0.002255 **
## valence -25.15503 5.77332 -4.357 1.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.61 on 3482 degrees of freedom
## Multiple R-squared: 0.4143, Adjusted R-squared: 0.4108
## F-statistic: 117.3 on 21 and 3482 DF, p-value: < 2.2e-16
check assumptions with diagnostic plots
#par(mfrow = c(4,1))
plot(ml0)
Residuals vs. Fitted : Does not show a horizontal line. There is a noticeable pattern of a slight curve in the trend.
Normal QQ: Residual points roughly follow the normal QQ line. The left side falls below the normal line below theoretical quantile -2. Otherwise, the distribution of the residuals appears to roughly follow a normal distribution.
*Scale-Location: Due to the clear decreasing line rather thana flat horizontal line of equally spread points, there is clear evidence of violation for homogeneity of the variance of the residuals. May need to run the model on a transformation of the outcome variable which is the months of song release.
qqnorm(ml0$residuals)
qqline(ml0$residuals, col = "red")
checking for multicollinearity
car::vif(ml0)
## GVIF Df GVIF^(1/(2*Df))
## popularity 1.166795 1 1.080183
## duration 1.095607 1 1.046713
## acousticness 1.572984 1 1.254187
## danceability 1.463926 1 1.209928
## energy 2.625882 1 1.620457
## instrumentalness 1.062607 1 1.030828
## key 1.085836 11 1.003750
## loudness 1.974944 1 1.405327
## speechiness 1.078396 1 1.038458
## tempo 1.112227 1 1.054622
## valence 1.540583 1 1.241202
Overall, the current model does not meet the assumptions of the linear model. We need to make a transformation on the response variable to achieve normality of the residuals
yhat.ml0 = predict(ml0, newdata = test0)
data.frame(
RMSE = RMSE(yhat.ml0, test0$months),
R2 = R2(yhat.ml0, test0$months)
)
## RMSE R2
## 1 56.76951 0.3780801
boxcox(ml0,lambda = seq(1.7, 3, 0.1), plotit = TRUE)
Optimal transformation is for \(\lambda = 2.3\)
summary(bc.transform(data$months, 2.3))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.01 130891.04 192492.74 180682.07 245871.22 306739.73
hist(bc.transform(data$months, 2.3))
ml0.bc <- lm(bc.transform(months, 2.3) ~. , data = train0)
summary(ml0.bc)
##
## Call:
## lm(formula = bc.transform(months, 2.3) ~ ., data = train0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -182515 -45410 3662 45295 209577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 387391.90 19370.62 19.999 < 2e-16 ***
## popularity 2137.24 62.36 34.275 < 2e-16 ***
## duration -22523.94 2084.80 -10.804 < 2e-16 ***
## acousticness 33787.57 7558.99 4.470 8.08e-06 ***
## danceability -61484.25 12091.30 -5.085 3.87e-07 ***
## energy -72490.65 12454.57 -5.820 6.40e-09 ***
## instrumentalness 79343.44 16134.42 4.918 9.16e-07 ***
## key1 -1387.89 5518.61 -0.251 0.80145
## key2 -5870.47 6832.21 -0.859 0.39027
## key3 -6175.19 7641.41 -0.808 0.41908
## key4 -15890.01 5415.81 -2.934 0.00337 **
## key5 -4046.15 5251.36 -0.770 0.44106
## key6 -6788.43 5483.33 -1.238 0.21580
## key7 670.53 6008.66 0.112 0.91115
## key8 661.37 6490.90 0.102 0.91885
## key9 -15530.42 5565.41 -2.791 0.00529 **
## key10 -15579.34 5728.48 -2.720 0.00657 **
## key11 -12658.78 5115.65 -2.475 0.01339 *
## loudness 11349.58 778.26 14.583 < 2e-16 ***
## speechiness 37433.06 17534.00 2.135 0.03284 *
## tempo -128.82 46.17 -2.790 0.00530 **
## valence -33700.33 6623.82 -5.088 3.81e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63800 on 3482 degrees of freedom
## Multiple R-squared: 0.4159, Adjusted R-squared: 0.4124
## F-statistic: 118.1 on 21 and 3482 DF, p-value: < 2.2e-16
check diagnostic plots
plot(ml0.bc)
normality has improved but nothing else ...
yhat.ml0.bc = predict(ml0.bc, newdata = test0 %>% mutate(months = bc.transform(test0$months, 2.3)))
data.frame(
RMSE = RMSE(yhat.ml0.bc, bc.transform(test0$months, 2.3)),
R2 = R2(yhat.ml0.bc, bc.transform(test0$months, 2.3))
)
## RMSE R2
## 1 64627.64 0.3899321
hist(log(360-train0$months))
summary(log(360-train0$months))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.398 3.732 4.304 4.261 4.796 5.878
ml0.log360 <- lm(log(360-train0$months) ~. , data = train0)
summary(ml0.log360)
##
## Call:
## lm(formula = log(360 - train0$months) ~ ., data = train0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.39175 -0.40146 0.04915 0.44693 1.61319
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7389301 0.1848139 14.820 < 2e-16 ***
## popularity -0.0215638 0.0005949 -36.246 < 2e-16 ***
## duration 0.1940970 0.0198910 9.758 < 2e-16 ***
## acousticness -0.2401500 0.0721199 -3.330 0.000878 ***
## danceability 0.4237910 0.1153624 3.674 0.000243 ***
## energy 0.5720095 0.1188284 4.814 1.54e-06 ***
## instrumentalness -0.8481059 0.1539376 -5.509 3.86e-08 ***
## key1 -0.0127793 0.0526528 -0.243 0.808245
## key2 0.0157264 0.0651857 0.241 0.809372
## key3 0.0354738 0.0729063 0.487 0.626596
## key4 0.1181835 0.0516720 2.287 0.022245 *
## key5 0.0170423 0.0501029 0.340 0.733768
## key6 0.0505054 0.0523161 0.965 0.334417
## key7 -0.0435365 0.0573283 -0.759 0.447650
## key8 -0.0555081 0.0619293 -0.896 0.370147
## key9 0.1087090 0.0530993 2.047 0.040706 *
## key10 0.1264746 0.0546551 2.314 0.020723 *
## key11 0.1012854 0.0488081 2.075 0.038044 *
## loudness -0.0871796 0.0074253 -11.741 < 2e-16 ***
## speechiness -0.3296733 0.1672909 -1.971 0.048842 *
## tempo 0.0009236 0.0004405 2.097 0.036072 *
## valence 0.3243115 0.0631975 5.132 3.03e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6087 on 3482 degrees of freedom
## Multiple R-squared: 0.4105, Adjusted R-squared: 0.4069
## F-statistic: 115.4 on 21 and 3482 DF, p-value: < 2.2e-16
check diagnostic plots
plot(ml0.log360)
Predictions and performance diagnostics
yhat.ml0.log360 = predict(ml0.log360, newdata = test0 %>% mutate(months = log(360 - test0$months)))
data.frame(
RMSE = RMSE(yhat.ml0.log360, log(360 - test0$months)),
R2 = R2(yhat.ml0.log360, log(360 - test0$months))
)
## RMSE R2
## 1 0.5976752 0.3932124
hist(sqrt(350-train0$months))
summary(sqrt(350-train0$months))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.635 8.000 8.377 10.536 18.628
ml0.sqrt350 <- lm(sqrt(350-train0$months) ~. , data = train0)
summary(ml0.sqrt350)
##
## Call:
## lm(formula = sqrt(350 - train0$months) ~ ., data = train0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4296 -2.0157 -0.0706 2.0165 9.4299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.352232 0.877636 -1.541 0.12346
## popularity -0.097745 0.002825 -34.598 < 2e-16 ***
## duration 1.011761 0.094457 10.711 < 2e-16 ***
## acousticness -1.411834 0.342480 -4.122 3.84e-05 ***
## danceability 2.630235 0.547828 4.801 1.64e-06 ***
## energy 3.783006 0.564287 6.704 2.36e-11 ***
## instrumentalness -3.941710 0.731012 -5.392 7.43e-08 ***
## key1 0.040263 0.250035 0.161 0.87208
## key2 0.235396 0.309551 0.760 0.44704
## key3 0.212147 0.346214 0.613 0.54007
## key4 0.712502 0.245377 2.904 0.00371 **
## key5 0.187106 0.237926 0.786 0.43169
## key6 0.315464 0.248436 1.270 0.20424
## key7 -0.036510 0.272238 -0.134 0.89332
## key8 -0.112283 0.294087 -0.382 0.70263
## key9 0.678401 0.252156 2.690 0.00717 **
## key10 0.660914 0.259544 2.546 0.01093 *
## key11 0.589000 0.231778 2.541 0.01109 *
## loudness -0.560716 0.035261 -15.902 < 2e-16 ***
## speechiness -1.819528 0.794423 -2.290 0.02206 *
## tempo 0.005484 0.002092 2.622 0.00879 **
## valence 1.494620 0.300109 4.980 6.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.89 on 3482 degrees of freedom
## Multiple R-squared: 0.4249, Adjusted R-squared: 0.4214
## F-statistic: 122.5 on 21 and 3482 DF, p-value: < 2.2e-16
check diagnostic plots
plot(ml0.sqrt350)
Predictions and performance diagnostics
yhat.ml0.sqrt350 = predict(ml0.sqrt350, newdata = test0 %>% dplyr::mutate(months = sqrt(350-test0$months)))
data.frame(
RMSE = RMSE(yhat.ml0.sqrt350, sqrt(350-test0$months)),
R2 = R2(yhat.ml0.sqrt350, sqrt(350-test0$months))
)
## RMSE R2
## 1 2.894936 0.3988485
train0$months <- log(360 - train0$months)
test0$months <- log(360 - test0$months)
train1$months <- log(360 - train1$months)
test1$months <- log(360 - test1$months)
mlr0 <- lm(months ~. , data = train0)
assess model assumptions
plot(mlr0)
view model
summary(mlr0)
##
## Call:
## lm(formula = months ~ ., data = train0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.39175 -0.40146 0.04915 0.44693 1.61319
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7389301 0.1848139 14.820 < 2e-16 ***
## popularity -0.0215638 0.0005949 -36.246 < 2e-16 ***
## duration 0.1940970 0.0198910 9.758 < 2e-16 ***
## acousticness -0.2401500 0.0721199 -3.330 0.000878 ***
## danceability 0.4237910 0.1153624 3.674 0.000243 ***
## energy 0.5720095 0.1188284 4.814 1.54e-06 ***
## instrumentalness -0.8481059 0.1539376 -5.509 3.86e-08 ***
## key1 -0.0127793 0.0526528 -0.243 0.808245
## key2 0.0157264 0.0651857 0.241 0.809372
## key3 0.0354738 0.0729063 0.487 0.626596
## key4 0.1181835 0.0516720 2.287 0.022245 *
## key5 0.0170423 0.0501029 0.340 0.733768
## key6 0.0505054 0.0523161 0.965 0.334417
## key7 -0.0435365 0.0573283 -0.759 0.447650
## key8 -0.0555081 0.0619293 -0.896 0.370147
## key9 0.1087090 0.0530993 2.047 0.040706 *
## key10 0.1264746 0.0546551 2.314 0.020723 *
## key11 0.1012854 0.0488081 2.075 0.038044 *
## loudness -0.0871796 0.0074253 -11.741 < 2e-16 ***
## speechiness -0.3296733 0.1672909 -1.971 0.048842 *
## tempo 0.0009236 0.0004405 2.097 0.036072 *
## valence 0.3243115 0.0631975 5.132 3.03e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6087 on 3482 degrees of freedom
## Multiple R-squared: 0.4105, Adjusted R-squared: 0.4069
## F-statistic: 115.4 on 21 and 3482 DF, p-value: < 2.2e-16
All variables except Key1, Key2, Key3, Kdy5, Key6, Key7, and Key8 are significant to the model in predicting the month of song release.
The Full Multiple Linear Regression model for mode 0 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 2.739 - 0.022{popularity} + 0.194{duration} - 0.240{acousticness} \\ + 0.424{danceability} + 0.572{energy} - 0.848{instrumentalness} \\ - 0.013{key1} + 0.016{key2} + 0.035{key3} + 0.118{key4} + 0.017{key5} + 0.051{key6} \\ - 0.087{key7} -0.056{key8} + 0.109{key9} + 0.126{key10} + 0.101{key11} \\ - 0.087{loudness} - 0.330{speechiness} + 0.001{tempo} + 0.324{valence} \]
# prediction on test data
yhat.mlr0 = predict(mlr0, newdata=test0)
# RMSE for test data
error.mlr0 <- yhat.mlr0 - test0$months
rmse.mlr0 <- sqrt(mean(error.mlr0^2))
data.frame(
RMSE = RMSE(yhat.mlr0, test0$months),
R2 = R2(yhat.mlr0, test0$months)
)
## RMSE R2
## 1 0.5976752 0.3932124
set.seed(123)
train_control <- trainControl(method = "cv",
number = 10)
bye <- capture.output(mlr.step_kcv0 <- train(months ~. , data = train0,
method = "lmStepAIC", trControl = train_control))
print(mlr.step_kcv0)
## Linear Regression with Stepwise Selection
##
## 3504 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3153, 3153, 3153, 3154, 3155, 3153, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.6115382 0.4023572 0.4944596
mlr.step_kcv0$finalModel
##
## Call:
## lm(formula = .outcome ~ popularity + duration + acousticness +
## danceability + energy + instrumentalness + key4 + key6 +
## key9 + key10 + key11 + loudness + speechiness + tempo + valence,
## data = dat)
##
## Coefficients:
## (Intercept) popularity duration acousticness
## 2.7327797 -0.0215505 0.1933518 -0.2409479
## danceability energy instrumentalness key4
## 0.4254183 0.5722443 -0.8523074 0.1245219
## key6 key9 key10 key11
## 0.0567355 0.1150826 0.1327266 0.1074954
## loudness speechiness tempo valence
## -0.0870517 -0.3278520 0.0009343 0.3244776
The Stepwise Multiple Linear Regression model for mode 0 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 2.732 - 0.022{popularity} + 0.193{duration} - 0.241{acousticness} \\ + 0.425{danceability} + 0.572{energy} - 0.852{instrumentalness} \\ + 0.125{key4} + 0.057{key6} \\ + 0.115{key9} + 0.133{key10} + 0.107{key11} \\ - 0.087{loudness} - 0.328{speechiness} + 0.001{tempo} + 0.324{valence} \]
Compared to the full multiple linear regression model, the stepwise regression model omits the variables: Key1, Key2, Key3, Key5, Key7, and Key8.
prediction on test data
# prediction on test data
yhat.step_kcv0 = predict(mlr.step_kcv0$finalModel, newdata=key.dummy(test0))
# RMSE for test data
error.step_kcv0 <- yhat.step_kcv0 - test0$months
rmse.step_kcv0 <- sqrt(mean(error.step_kcv0^2))
data.frame(
RMSE = RMSE(yhat.step_kcv0, test0$months),
R2 = R2(yhat.step_kcv0, test0$months)
)
## RMSE R2
## 1 0.5980156 0.3925419
allreg.models0 <- regsubsets(months ~., data = train0, nvmax = 21)
summary(allreg.models0)
## Subset selection object
## Call: regsubsets.formula(months ~ ., data = train0, nvmax = 21)
## 21 Variables (and intercept)
## Forced in Forced out
## popularity FALSE FALSE
## duration FALSE FALSE
## acousticness FALSE FALSE
## danceability FALSE FALSE
## energy FALSE FALSE
## instrumentalness FALSE FALSE
## key1 FALSE FALSE
## key2 FALSE FALSE
## key3 FALSE FALSE
## key4 FALSE FALSE
## key5 FALSE FALSE
## key6 FALSE FALSE
## key7 FALSE FALSE
## key8 FALSE FALSE
## key9 FALSE FALSE
## key10 FALSE FALSE
## key11 FALSE FALSE
## loudness FALSE FALSE
## speechiness FALSE FALSE
## tempo FALSE FALSE
## valence FALSE FALSE
## 1 subsets of each size up to 21
## Selection Algorithm: exhaustive
## popularity duration acousticness danceability energy instrumentalness
## 1 ( 1 ) "*" " " " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " "
## 3 ( 1 ) "*" "*" " " "*" " " " "
## 4 ( 1 ) "*" "*" " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " "*" " "
## 6 ( 1 ) "*" "*" " " " " "*" "*"
## 7 ( 1 ) "*" "*" " " "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*"
## 20 ( 1 ) "*" "*" "*" "*" "*" "*"
## 21 ( 1 ) "*" "*" "*" "*" "*" "*"
## key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 loudness
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 7 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 8 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 9 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*"
## 10 ( 1 ) " " " " " " " " " " " " "*" "*" " " " " " " "*"
## 11 ( 1 ) " " " " " " "*" " " " " " " " " " " "*" "*" "*"
## 12 ( 1 ) " " " " " " "*" " " " " " " " " "*" "*" "*" "*"
## 13 ( 1 ) " " " " " " "*" " " " " " " " " "*" "*" "*" "*"
## 14 ( 1 ) " " " " " " "*" " " " " " " " " "*" "*" "*" "*"
## 15 ( 1 ) " " " " " " "*" " " "*" " " " " "*" "*" "*" "*"
## 16 ( 1 ) " " " " " " "*" " " "*" " " "*" "*" "*" "*" "*"
## 17 ( 1 ) " " " " " " "*" " " "*" "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" " " " " "*" " " "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 20 ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 21 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## speechiness tempo valence
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " "*"
## 5 ( 1 ) " " " " "*"
## 6 ( 1 ) " " " " "*"
## 7 ( 1 ) " " " " "*"
## 8 ( 1 ) " " " " "*"
## 9 ( 1 ) " " " " "*"
## 10 ( 1 ) " " " " "*"
## 11 ( 1 ) " " " " "*"
## 12 ( 1 ) " " " " "*"
## 13 ( 1 ) " " "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
## 18 ( 1 ) "*" "*" "*"
## 19 ( 1 ) "*" "*" "*"
## 20 ( 1 ) "*" "*" "*"
## 21 ( 1 ) "*" "*" "*"
Assess models
allreg.res0 <- summary(allreg.models0)
allreg.compare0 <- data.frame(model = c(1:21),
Adj.R2 = allreg.res0$adjr2,
CP = allreg.res0$cp)
allreg.compare0
## model Adj.R2 CP
## 1 1 0.3462533 360.15498
## 2 2 0.3591335 285.02115
## 3 3 0.3707186 217.57361
## 4 4 0.3838071 141.29567
## 5 5 0.3921850 92.84423
## 6 6 0.3971156 64.74779
## 7 7 0.4005655 45.39542
## 8 8 0.4025885 34.46326
## 9 9 0.4033295 31.09094
## 10 10 0.4041639 27.17060
## 11 11 0.4050357 23.03289
## 12 12 0.4061457 17.49622
## 13 13 0.4066369 15.60448
## 14 14 0.4071063 13.84283
## 15 15 0.4073589 13.35758
## 16 16 0.4074023 14.10287
## 17 17 0.4074617 14.75469
## 18 18 0.4073768 16.25484
## 19 19 0.4072293 18.12198
## 20 20 0.4070700 20.05820
## 21 21 0.4069096 22.00000
The model with the smallest CP value and greatest Adjusted R2 value is model number 15.
mlr.allreg0 <- lm(months ~ popularity +duration +acousticness +danceability +energy +instrumentalness+
key4+key6+key9 +key10 +key11 +loudness +speechiness+ tempo +valence, data = key.dummy(train0))
summary(mlr.allreg0)
##
## Call:
## lm(formula = months ~ popularity + duration + acousticness +
## danceability + energy + instrumentalness + key4 + key6 +
## key9 + key10 + key11 + loudness + speechiness + tempo + valence,
## data = key.dummy(train0))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.38943 -0.40057 0.04912 0.44462 1.61241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7327797 0.1802230 15.163 < 2e-16 ***
## popularity -0.0215505 0.0005940 -36.280 < 2e-16 ***
## duration 0.1933518 0.0198500 9.741 < 2e-16 ***
## acousticness -0.2409479 0.0719685 -3.348 0.000823 ***
## danceability 0.4254183 0.1152193 3.692 0.000226 ***
## energy 0.5722443 0.1185570 4.827 1.45e-06 ***
## instrumentalness -0.8523074 0.1537756 -5.543 3.20e-08 ***
## key4 0.1245219 0.0351732 3.540 0.000405 ***
## key6 0.0567355 0.0359754 1.577 0.114871
## key9 0.1150826 0.0371718 3.096 0.001977 **
## key10 0.1327266 0.0392897 3.378 0.000738 ***
## key11 0.1074954 0.0307223 3.499 0.000473 ***
## loudness -0.0870517 0.0074155 -11.739 < 2e-16 ***
## speechiness -0.3278520 0.1670495 -1.963 0.049772 *
## tempo 0.0009343 0.0004402 2.122 0.033869 *
## valence 0.3244776 0.0629704 5.153 2.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6085 on 3488 degrees of freedom
## Multiple R-squared: 0.4099, Adjusted R-squared: 0.4074
## F-statistic: 161.5 on 15 and 3488 DF, p-value: < 2.2e-16
prediction on test data
# prediction on test data
yhat.allreg0 = predict(mlr.allreg0, newdata=key.dummy(test0))
# RMSE for test data
# error.allreg0 <- yhat.allreg0 - test0$months
# rmse.allreg0 <- sqrt(mean(error.allreg0^2))
data.frame(
RMSE = RMSE(yhat.allreg0, test0$months),
R2 = R2(yhat.allreg0, test0$months)
)
## RMSE R2
## 1 0.5980156 0.3925419
lm.ridge0 <- train(months ~. , data = train0, method = "glmnet",
trControl = train_control,
tuneGrid = expand.grid(alpha = 0,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.ridge0$bestTune
## alpha lambda
## 45 0 0.045
# best coefficient
lm.ridge0.model <- coef(lm.ridge0$finalModel, lm.ridge0$bestTune$lambda)
lm.ridge0.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.8657735959
## popularity -0.0205267725
## duration 0.1899716805
## acousticness -0.2445830342
## danceability 0.4128163847
## energy 0.4823365034
## instrumentalness -0.7491778746
## key1 -0.0342147260
## key2 -0.0056979626
## key3 0.0075189682
## key4 0.0953873860
## key5 -0.0072177244
## key6 0.0225877475
## key7 -0.0627117017
## key8 -0.0749979230
## key9 0.0823586421
## key10 0.0986420788
## key11 0.0783989653
## loudness -0.0798794742
## speechiness -0.3084093591
## tempo 0.0008529497
## valence 0.3234950755
The Ridge Multiple Linear Regression model for mode 0 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 2.866 - 0.021{popularity} + 0.190{duration} - 0.245{acousticness} \\ + 0.413{danceability} + 0.482{energy} - 0.749{instrumentalness} \\ - 0.034{key1} - 0.006{key2} + 0.008{key3} + 0.095{key4} - 0.007{key5} + 0.023{key6} \\ - 0.063{key7} - 0.075{key8} + 0.082{key9} + 0.099{key10} + 0.078{key11} \\ - 0.080{loudness} - 0.308{speechiness} + 0.001{tempo} + 0.323{valence} \]
# prediction on test data
yhat.ridge0 = predict(lm.ridge0, s=lm.ridge0$bestTune,newdata=test0)
# RMSE for test data
data.frame(
RMSE = RMSE(yhat.ridge0, test0$months),
R2 = R2(yhat.ridge0, test0$months)
)
## RMSE R2
## 1 0.5981751 0.3921513
lm.lasso0 <- train(months ~. , data = train0, method = "glmnet",
trControl = train_control,
tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.lasso0$bestTune
## alpha lambda
## 2 1 0.002
# best coefficient
lm.lasso0.model <- coef(lm.lasso0$finalModel, lm.lasso0$bestTune$lambda)
lm.lasso0.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.8365852962
## popularity -0.0215614447
## duration 0.1908720137
## acousticness -0.2395615122
## danceability 0.4088128971
## energy 0.5333749094
## instrumentalness -0.8078442972
## key1 -0.0270821553
## key2 .
## key3 0.0040833394
## key4 0.0923273187
## key5 .
## key6 0.0238454868
## key7 -0.0555632703
## key8 -0.0676519346
## key9 0.0819159291
## key10 0.0992058126
## key11 0.0770663223
## loudness -0.0839496407
## speechiness -0.2803564060
## tempo 0.0008109799
## valence 0.3212503863
The Lasso Multiple Linear Regression model for mode 0 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 2.837 - 0.022{popularity} + 0.191{duration} - 0.240{acousticness} \\ + 0.409{danceability} + 0.533{energy} - 0.808{instrumentalness} \\ - 0.027{key1} + 0.004{key3} + 0.092{key4} + 0.024{key6} \\ - 0.056{key7} - 0.068{key8} + 0.082{key9} + 0.099{key10} + 0.077{key11} \\ - 0.084{loudness} - 0.280{speechiness} + 0.001{tempo} + 0.321{valence} \]
Compared to the full model, the variables key2 and key 5 are not kept in the model.
# prediction on test data
yhat.lasso0 = predict(lm.lasso0, s=lm.lasso0$bestTune,newdata=test0)
# RMSE for test data
error.lasso0 <- yhat.lasso0 - test0$months
rmse.lasso0 <- sqrt(mean(error.lasso0^2))
data.frame(
RMSE = RMSE(yhat.lasso0, test0$months),
R2 = R2(yhat.lasso0, test0$months)
)
## RMSE R2
## 1 0.5975859 0.3932411
results0 <- data.frame(Model = c("FullMLR", "Stepwise", "AllSubsets","Ridge", "Lasso"),
RMSE = c(0.5976752 , 0.5980156, 0.5980156,0.5981751,0.5975859),
R2 = c(0.3932124, 0.3925419, 0.3925419,0.3921513,0.3932411))
results0
## Model RMSE R2
## 1 FullMLR 0.5976752 0.3932124
## 2 Stepwise 0.5980156 0.3925419
## 3 AllSubsets 0.5980156 0.3925419
## 4 Ridge 0.5981751 0.3921513
## 5 Lasso 0.5975859 0.3932411
Lasso Provides the best model
\[ log(360 - \hat{months}) = 2.837 - 0.022{popularity} + 0.191{duration} - 0.240{acousticness} \\ + 0.409{danceability} + 0.533{energy} - 0.808{instrumentalness} \\ - 0.027{key1} + 0.004{key4} + 0.092{key4} + 0.024{key6} \\ - 0.056{key7} - 0.068{key8} + 0.082{key9} + 0.099{key10} + 0.077{key11} \\ - 0.084{loudness} - 0.280{speechiness} + 0.001{tempo} + 0.321{valence} \]
mlr1 <- lm(months ~. , data = train1)
assess model assumptions
plot(mlr1)
view model
summary(mlr1)
##
## Call:
## lm(formula = months ~ ., data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48440 -0.39951 0.05378 0.45331 1.76459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9901418 0.1434374 20.846 < 2e-16 ***
## popularity -0.0198841 0.0005122 -38.818 < 2e-16 ***
## duration 0.1251172 0.0155343 8.054 9.73e-16 ***
## acousticness -0.1274822 0.0508609 -2.506 0.01222 *
## danceability 0.6314340 0.0873034 7.233 5.39e-13 ***
## energy 0.3826541 0.0973081 3.932 8.51e-05 ***
## instrumentalness -0.7706272 0.1212973 -6.353 2.28e-10 ***
## key1 0.0842570 0.0314613 2.678 0.00743 **
## key2 0.0287030 0.0336488 0.853 0.39369
## key3 0.0202750 0.0516321 0.393 0.69457
## key4 0.0680017 0.0436575 1.558 0.11938
## key5 0.0749496 0.0383254 1.956 0.05056 .
## key6 0.0957015 0.0375012 2.552 0.01074 *
## key7 0.1019746 0.0312840 3.260 0.00112 **
## key8 0.1189275 0.0362960 3.277 0.00106 **
## key9 0.0124663 0.0389359 0.320 0.74885
## key10 0.0810947 0.0453378 1.789 0.07372 .
## key11 0.0463297 0.0403927 1.147 0.25144
## loudness -0.0757560 0.0061732 -12.272 < 2e-16 ***
## speechiness -0.1799930 0.1442184 -1.248 0.21206
## tempo 0.0007324 0.0003286 2.229 0.02589 *
## valence 0.2936440 0.0531944 5.520 3.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6245 on 5482 degrees of freedom
## Multiple R-squared: 0.3206, Adjusted R-squared: 0.318
## F-statistic: 123.2 on 21 and 5482 DF, p-value: < 2.2e-16
The Full Multiple Linear Regression model for mode 1 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 2.990 - 0.020{popularity} + 0.125{duration} - 0.127{acousticness} \\ + 0.631{danceability} + 0.383{energy} - 0.771{instrumentalness} \\ + 0.084{key1} + 0.029{key2} + 0.020{key3} + 0.068{key4} + 0.075{key5} + 0.096{key6} \\ + 0.102{key7} + 0.119{key8} + 0.012{key9} + 0.081{key10} + 0.046{key11} \\ - 0.076{loudness} - 0.180{speechiness} + 0.001{tempo} + 0.294{valence} \]
# prediction on test data
yhat.mlr1 = predict(mlr1, newdata=test1)
# RMSE for test data
error.mlr1 <- yhat.mlr1 - test1$months
rmse.mlr1 <- sqrt(mean(error.mlr1^2))
data.frame(
RMSE = RMSE(yhat.mlr1, test1$months),
R2 = R2(yhat.mlr1, test1$months)
)
## RMSE R2
## 1 0.6078257 0.3312429
set.seed(123)
train_control <- trainControl(method = "cv",
number = 10)
bye <- capture.output(mlr.step_kcv1 <- train(months ~. , data = train1,
method = "lmStepAIC", trControl = train_control))
print(mlr.step_kcv1)
## Linear Regression with Stepwise Selection
##
## 5504 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4953, 4953, 4955, 4954, 4953, 4954, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.6266852 0.3145441 0.5035447
mlr.step_kcv1$finalModel
##
## Call:
## lm(formula = .outcome ~ popularity + duration + acousticness +
## danceability + energy + instrumentalness + key1 + key5 +
## key6 + key7 + key8 + key10 + loudness + tempo + valence,
## data = dat)
##
## Coefficients:
## (Intercept) popularity duration acousticness
## 3.0299902 -0.0199787 0.1273096 -0.1239002
## danceability energy instrumentalness key1
## 0.6302868 0.3566622 -0.7653059 0.0616367
## key5 key6 key7 key8
## 0.0529274 0.0732353 0.0786565 0.0959522
## key10 loudness tempo valence
## 0.0597350 -0.0739348 0.0006867 0.2887687
The Multiple Linear Regression model resulting from stepwise 10 fold cross validation for mode 1 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 3.020 - 0.020{popularity} + 0.127{duration} - 0.124{acousticness} \\ + 0.630{danceability} + 0.357{energy} - 0.765{instrumentalness} + 0.062{key1} \\ + 0.053{key5} + 0.073{key6} + 0.079{key7} + 0.096{key8} + 0.060{key10} \\ - 0.074{loudness} + 0.001{tempo} + 0.289{valence} \]
Compared to the full multiple linear regression model, the stewise multiple regression leaves out the dummy variables of Key2, key3, key4, key11 and speechiness.
prediction on test data
# prediction on test data
yhat.step_kcv1 = predict(mlr.step_kcv1$finalModel, newdata=key.dummy(test1))
# RMSE for test data
error.step_kcv1 <- yhat.step_kcv1 - test1$months
rmse.step_kcv1 <- sqrt(mean(error.step_kcv1^2))
data.frame(
RMSE = RMSE(yhat.step_kcv1, test1$months),
R2 = R2(yhat.step_kcv1, test1$months)
)
## RMSE R2
## 1 0.6070693 0.3329448
allreg.models1 <- regsubsets(months ~., data = train1, nvmax = 21)
summary(allreg.models1)
## Subset selection object
## Call: regsubsets.formula(months ~ ., data = train1, nvmax = 21)
## 21 Variables (and intercept)
## Forced in Forced out
## popularity FALSE FALSE
## duration FALSE FALSE
## acousticness FALSE FALSE
## danceability FALSE FALSE
## energy FALSE FALSE
## instrumentalness FALSE FALSE
## key1 FALSE FALSE
## key2 FALSE FALSE
## key3 FALSE FALSE
## key4 FALSE FALSE
## key5 FALSE FALSE
## key6 FALSE FALSE
## key7 FALSE FALSE
## key8 FALSE FALSE
## key9 FALSE FALSE
## key10 FALSE FALSE
## key11 FALSE FALSE
## loudness FALSE FALSE
## speechiness FALSE FALSE
## tempo FALSE FALSE
## valence FALSE FALSE
## 1 subsets of each size up to 21
## Selection Algorithm: exhaustive
## popularity duration acousticness danceability energy instrumentalness
## 1 ( 1 ) "*" " " " " " " " " " "
## 2 ( 1 ) "*" " " " " "*" " " " "
## 3 ( 1 ) "*" " " " " " " " " " "
## 4 ( 1 ) "*" "*" " " " " " " " "
## 5 ( 1 ) "*" "*" " " "*" "*" " "
## 6 ( 1 ) "*" "*" " " "*" "*" "*"
## 7 ( 1 ) "*" "*" " " "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*"
## 20 ( 1 ) "*" "*" "*" "*" "*" "*"
## 21 ( 1 ) "*" "*" "*" "*" "*" "*"
## key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 loudness
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 4 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 5 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 6 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 7 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 8 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 9 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
## 10 ( 1 ) " " " " " " " " " " " " "*" "*" " " " " " " "*"
## 11 ( 1 ) " " " " " " " " " " " " "*" "*" " " " " " " "*"
## 12 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " "*"
## 13 ( 1 ) "*" " " " " " " " " "*" "*" "*" " " " " " " "*"
## 14 ( 1 ) "*" " " " " " " "*" "*" "*" "*" " " " " " " "*"
## 15 ( 1 ) "*" " " " " " " "*" "*" "*" "*" " " "*" " " "*"
## 16 ( 1 ) "*" " " " " "*" "*" "*" "*" "*" " " "*" " " "*"
## 17 ( 1 ) "*" " " " " "*" "*" "*" "*" "*" " " "*" " " "*"
## 18 ( 1 ) "*" " " " " "*" "*" "*" "*" "*" " " "*" "*" "*"
## 19 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" " " "*" "*" "*"
## 20 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*"
## 21 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## speechiness tempo valence
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " "*"
## 4 ( 1 ) " " " " "*"
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " "*"
## 8 ( 1 ) " " " " "*"
## 9 ( 1 ) " " "*" "*"
## 10 ( 1 ) " " " " "*"
## 11 ( 1 ) " " "*" "*"
## 12 ( 1 ) " " "*" "*"
## 13 ( 1 ) " " "*" "*"
## 14 ( 1 ) " " "*" "*"
## 15 ( 1 ) " " "*" "*"
## 16 ( 1 ) " " "*" "*"
## 17 ( 1 ) "*" "*" "*"
## 18 ( 1 ) "*" "*" "*"
## 19 ( 1 ) "*" "*" "*"
## 20 ( 1 ) "*" "*" "*"
## 21 ( 1 ) "*" "*" "*"
allreg.res1 <- summary(allreg.models1)
allreg.compare1 <- data.frame(model = c(1:21),
Adj.R2 = allreg.res1$adjr2,
CP = allreg.res1$cp)
allreg.compare1
## model Adj.R2 CP
## 1 1 0.2662577 419.01033
## 2 2 0.2809405 301.51148
## 3 3 0.2915207 217.13918
## 4 4 0.2984165 162.50349
## 5 5 0.3064961 98.34476
## 6 6 0.3114590 59.32933
## 7 7 0.3153303 29.12450
## 8 8 0.3160799 24.08108
## 9 9 0.3164721 21.91925
## 10 10 0.3169039 19.43997
## 11 11 0.3173029 17.22495
## 12 12 0.3175376 16.33495
## 13 13 0.3178574 14.76005
## 14 14 0.3179760 14.80510
## 15 15 0.3181035 14.77934
## 16 16 0.3181764 15.19312
## 17 17 0.3182498 15.60317
## 18 18 0.3182303 16.76046
## 19 19 0.3181744 18.21054
## 20 20 0.3180635 20.10251
## 21 21 0.3179518 22.00000
The model with the smallest CP value and greatest Adjusted R2 value is model number 15.
mlr.allreg1 <- lm(months ~ popularity +duration +acousticness +danceability +energy +instrumentalness+
key1+key5 +key6 +key7 +key8+loudness + tempo +valence, data = key.dummy(train1))
summary(mlr.allreg1)
##
## Call:
## lm(formula = months ~ popularity + duration + acousticness +
## danceability + energy + instrumentalness + key1 + key5 +
## key6 + key7 + key8 + loudness + tempo + valence, data = key.dummy(train1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48754 -0.40040 0.05556 0.45256 1.83006
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0334496 0.1417347 21.402 < 2e-16 ***
## popularity -0.0199882 0.0005048 -39.599 < 2e-16 ***
## duration 0.1277720 0.0153891 8.303 < 2e-16 ***
## acousticness -0.1219629 0.0506774 -2.407 0.016132 *
## danceability 0.6298047 0.0871300 7.228 5.56e-13 ***
## energy 0.3572279 0.0950323 3.759 0.000172 ***
## instrumentalness -0.7678735 0.1209540 -6.348 2.35e-10 ***
## key1 0.0567054 0.0264327 2.145 0.031974 *
## key5 0.0478245 0.0342038 1.398 0.162103
## key6 0.0682239 0.0333567 2.045 0.040874 *
## key7 0.0738130 0.0261831 2.819 0.004833 **
## key8 0.0910011 0.0319686 2.847 0.004436 **
## loudness -0.0740407 0.0060210 -12.297 < 2e-16 ***
## tempo 0.0006795 0.0003269 2.078 0.037717 *
## valence 0.2887193 0.0530373 5.444 5.45e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6245 on 5489 degrees of freedom
## Multiple R-squared: 0.3197, Adjusted R-squared: 0.318
## F-statistic: 184.3 on 14 and 5489 DF, p-value: < 2.2e-16
prediction on test data
# prediction on test data
yhat.allreg1 = predict(mlr.allreg1, newdata=key.dummy(test1))
# RMSE for test data
# error.allreg0 <- yhat.allreg0 - test0$months
# rmse.allreg0 <- sqrt(mean(error.allreg0^2))
data.frame(
RMSE = RMSE(yhat.allreg1, test1$months),
R2 = R2(yhat.allreg1, test1$months)
)
## RMSE R2
## 1 0.6071187 0.3328339
lm.ridge1 <- train(months ~. , data = train1, method = "glmnet",
trControl = train_control,
tuneGrid = expand.grid(alpha = 0,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.ridge1$bestTune
## alpha lambda
## 38 0 0.038
# best coefficient
lm.ridge1.model <- coef(lm.ridge1$finalModel, lm.ridge1$bestTune$lambda)
lm.ridge1.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 3.1431295345
## popularity -0.0190324693
## duration 0.1221767160
## acousticness -0.1418149731
## danceability 0.5970842302
## energy 0.2791295082
## instrumentalness -0.6904944050
## key1 0.0657555894
## key2 0.0117781483
## key3 0.0010085253
## key4 0.0500939827
## key5 0.0563973003
## key6 0.0773915151
## key7 0.0810143099
## key8 0.0962618966
## key9 -0.0050702623
## key10 0.0647627234
## key11 0.0284046850
## loudness -0.0675138250
## speechiness -0.1577203949
## tempo 0.0006447657
## valence 0.2966971890
The Ridge Multiple Linear Regression model for mode 1 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 3.143 - 0.019{popularity} + 0.122{duration} - 0.142{acousticness} \\ + 0.597{danceability} + 0.279{energy} - 0.690{instrumentalness} \\ + 0.066{key1} + 0.012{key2} + 0.001{key3} + 0.050{key4} + 0.056{key5} + 0.077{key6} \\ + 0.081{key7} + 0.096{key8} - 0.005{key9} + 0.065{key10} + 0.028{key11} \\ - 0.068{loudness} - 0.158{speechiness} + 0.001{tempo} + 0.297{valence} \]
The ridge model keeps all variables, therefore, there have been none removed for the model. However, the ridge model does aim to reduce all coefficients closer to zero.
# prediction on test data
yhat.ridge1 = predict(lm.ridge1, s=lm.ridge1$bestTune,newdata=test1)
# RMSE for test data
data.frame(
RMSE = RMSE(yhat.ridge1, test1$months),
R2 = R2(yhat.ridge1, test1$months)
)
## RMSE R2
## 1 0.607864 0.332307
lm.lasso1 <- train(months ~. , data = train1, method = "glmnet",
trControl = train_control,
tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))
#Best Tuning Parameter
lm.lasso1$bestTune
## alpha lambda
## 1 1 0.001
# best coefficient
lm.lasso1.model <- coef(lm.lasso1$finalModel, lm.lasso1$bestTune$lambda)
lm.lasso1.model
## 22 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 3.0552331450
## popularity -0.0198737952
## duration 0.1234290019
## acousticness -0.1271194063
## danceability 0.6223931899
## energy 0.3536795015
## instrumentalness -0.7502129375
## key1 0.0663408994
## key2 0.0099288569
## key3 .
## key4 0.0472801198
## key5 0.0554833191
## key6 0.0769102693
## key7 0.0835834150
## key8 0.0997023087
## key9 .
## key10 0.0612161728
## key11 0.0269418262
## loudness -0.0733775550
## speechiness -0.1481066141
## tempo 0.0006778449
## valence 0.2918487107
The Lasso Multiple Linear Regression model for mode 1 songs is written below with coefficients rounded to the thousandths. \[ log(360 - \hat{months}) = 3.055 - 0.020{popularity} + 0.123{duration} - 0.127{acousticness} \\ + 0.622{danceability} + 0.354{energy} - 0.750{instrumentalness} \\ + 0.066{key1} + 0.010{key2} + 0.047{key4} + 0.055{key5} + 0.077{key6} \\ + 0.084{key7} + 0.0997{key8} + 0.061{key10} + 0.027{key11} \\ - 0.073{loudness} - 0.148{speechiness} + 0.001{tempo} + 0.291{valence} \]
Compared to the full model, the Lasso model omits dummy variables Key3 and Key9
# prediction on test data
yhat.lasso1 = predict(lm.lasso1, s=lm.lasso1$bestTune,newdata=test1)
# RMSE for test data
error.lasso1 <- yhat.lasso1 - test1$months
rmse.lasso1 <- sqrt(mean(error.lasso1^2))
data.frame(
RMSE = RMSE(yhat.lasso1, test1$months),
R2 = R2(yhat.lasso1, test1$months)
)
## RMSE R2
## 1 0.6073883 0.3322821
results1 <- data.frame(Model = c("FullMLR", "Stepwise", "AllSubsets","Ridge", "Lasso"),
RMSE = c(0.6078257, 0.6070693,0.6071187,0.607864,0.6073883),
R2 = c(0.3312429, 0.3329448, 0.3328339,0.332307,0.3322821))
results1
## Model RMSE R2
## 1 FullMLR 0.6078257 0.3312429
## 2 Stepwise 0.6070693 0.3329448
## 3 AllSubsets 0.6071187 0.3328339
## 4 Ridge 0.6078640 0.3323070
## 5 Lasso 0.6073883 0.3322821
Stepwise regression is the best model.
summary(mlr.step_kcv1$finalModel)
##
## Call:
## lm(formula = .outcome ~ popularity + duration + acousticness +
## danceability + energy + instrumentalness + key1 + key5 +
## key6 + key7 + key8 + key10 + loudness + tempo + valence,
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4842 -0.4012 0.0563 0.4534 1.7748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0299902 0.1417423 21.377 < 2e-16 ***
## popularity -0.0199787 0.0005048 -39.580 < 2e-16 ***
## duration 0.1273096 0.0153911 8.272 < 2e-16 ***
## acousticness -0.1239002 0.0506910 -2.444 0.014548 *
## danceability 0.6302868 0.0871225 7.234 5.31e-13 ***
## energy 0.3566622 0.0950243 3.753 0.000176 ***
## instrumentalness -0.7653059 0.1209561 -6.327 2.70e-10 ***
## key1 0.0616367 0.0266563 2.312 0.020799 *
## key5 0.0529274 0.0343880 1.539 0.123832
## key6 0.0732353 0.0335388 2.184 0.029034 *
## key7 0.0786565 0.0264008 2.979 0.002902 **
## key8 0.0959522 0.0321543 2.984 0.002857 **
## key10 0.0597350 0.0419649 1.423 0.154663
## loudness -0.0739348 0.0060209 -12.280 < 2e-16 ***
## tempo 0.0006867 0.0003269 2.101 0.035717 *
## valence 0.2887687 0.0530324 5.445 5.40e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6244 on 5488 degrees of freedom
## Multiple R-squared: 0.32, Adjusted R-squared: 0.3181
## F-statistic: 172.1 on 15 and 5488 DF, p-value: < 2.2e-16
\[ log(360 - \hat{months}) = 3.020 - 0.020{popularity} + 0.127{duration} - 0.124{acousticness} \\ + 0.630{danceability} + 0.357{energy} - 0.765{instrumentalness} + 0.062{key1} \\ + 0.053{key5} + 0.073{key6} + 0.079{key7} + 0.096{key8} + 0.060{key10} \\ - 0.074{loudness} + 0.001{tempo} + 0.289{valence} \]